Session 9: Scraping Interactive Web Pages (part 2)

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2025-07-17

Browser automation

What is Browser Automation?

  • Definition: the process of using software to control web browsers and interact with web elements programmatically
  • Tools Involved: Common tools include Selenium, Puppeteer, and Playwright
  • These tools allow scripts to perform actions like clicking, typing, and navigating through web pages automatically

Common Uses of Browser Automation

  • Testing: Widely used in software development for automated testing of web applications to ensure they perform as expected across different environments and browsers
  • Task Automation: Simplifies repetitive tasks such as form submissions, account setups, or any routine that can be standardized across web interfaces

Browser Automation in Web Scraping

  • Dynamic Content Handling: Essential for scraping websites that load content dynamically with JavaScript. Automation tools can interact with the webpage, wait for content to load, and then scrape the data.
  • Simulation of User Interaction: Can mimic human browsing patterns to interact with elements (like dropdowns, sliders, etc.) that need to be manipulated to access data
  • Avoiding Detection: More sophisticated than basic scraping scripts, browser automation can help mimic human-like interactions, reducing the risk of being detected and blocked by anti-scraping technologies

Example: Google Maps

Goal

  1. Check the communte time programatically
  2. Extract the distance and time it takes to make a journey

Can we use rvest?

static <- read_html("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")
static |> 
  html_elements(".Fk3sm") |> 
  html_text2()
character(0)

google maps commute

Let’s use browser Automation

The new read_html_live from rvest solves this by emulating a browser:

# loads a real web browser
sess <- read_html_live("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")

You can have a look at the browser with:

sess$view()

Unfortunately, we do not get content yet. We first have to click on “Accept all”

Let’s use browser Automation

After manipulating something about the session, you need to read it into R again:

sess <- read_html_live("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")

Then we can extract information:

# the session behaves like a normal rvest html object
trip <- sess |> 
  html_elements("#section-directions-trip-0")

trip |> 
  html_element("h1") |> 
  html_text2()
[1] "via M8"
trip |> 
  html_element(".fontHeadlineSmall") |> 
  html_text2()
[1] "18 min"
trip |> 
  html_element(".fontBodyMedium") |> 
  html_text2()
[1] "4.1 miles"

Store the cookies

  • Now that we have accepted the cookie banner, a small set of cookies was stored in the browser
  • These are destroyed when we close R, however
  • We can extract them and save the session for the next run, so no manual intervention is neccesary
cookies <- sess$session$Network$getCookies()
saveRDS(cookies, "data/chromote_cookies.rds")

In the next run, you can load the cookies with:

cookies <- readRDS("data/chromote_cookies.rds")
sess <- sess$session$Network$setCookies(cookies = cookies$cookies)

Example: Reddit

Goal

  1. Get Reddit posts
  2. Get the timestamp, up/downvot score, post and comment text

Can we use rvest to get post URLs?

html_subreddit <- read_html("https://www.reddit.com/r/wallstreetbets/")

posts <- html_subreddit |> 
  html_elements("a") |> 
  html_attr("href") |> 
  str_subset("/comments/") |> 
  str_replace_all("^/", "https://www.reddit.com/") |> 
  unique()
posts
[1] "https://www.reddit.com/r/wallstreetbets/comments/1lx6a9h/weekly_earnings_thread_714_718/"          
[2] "https://www.reddit.com/r/wallstreetbets/comments/1m2yava/daily_discussion_thread_for_july_18_2025/"
[3] "https://www.reddit.com/r/wallstreetbets/comments/1m2mdax/open_just_getting_started/"               
[4] "https://www.reddit.com/r/wallstreetbets/comments/1ly582b/14m_54m_in_18_months_85_in_cash_now/"     
[5] "https://www.reddit.com/r/wallstreetbets/comments/1m2qlqy/open_yolo/"                               

This does not look too bad!

Although we only get 5 posts 🤔

Can we use rvest to get posts?

html_post <- read_html("https://www.reddit.com/r/wallstreetbets/comments/1ehcsoq/daily_discussion_thread_for_august_01_2024/")
post_data <- html_post |> 
  html_elements("shreddit-post")

post_data |> 
  html_attr("created-timestamp") |> 
  lubridate::as_datetime()
[1] "2024-08-01 09:57:12 UTC"
post_data |> 
  html_attr("id")
[1] "t3_1ehcsoq"
post_data |> 
  html_attr("subreddit-id")
[1] "t5_2th52"
post_data |> 
  html_attr("score")
[1] "205"
post_data |> 
  html_attr("comment-count")
[1] "15702"

We actually can!

Can we use rvest to get comments?

comments <- html_post |> 
  html_elements("shreddit-comment")
comments
{xml_nodeset (0)}

😑

How about read_html_live?

html_post_live <- read_html_live("https://www.reddit.com/r/wallstreetbets/comments/1ehcsoq/daily_discussion_thread_for_august_01_2024/")
comments <- html_post_live |> 
  html_elements("shreddit-comment")
comments
{xml_nodeset (170)}
 [1] <shreddit-comment author="YeezyThoughtMe" entity-filter-id="" thingid="t ...
 [2] <shreddit-comment author="SighRamp" entity-filter-id="" thingid="t1_lg2u ...
 [3] <shreddit-comment author="Taco_01" entity-filter-id="" thingid="t1_lg2x5 ...
 [4] <shreddit-comment author="Taco_01" entity-filter-id="" thingid="t1_lg2wx ...
 [5] <shreddit-comment author="Deadecigs" entity-filter-id="" thingid="t1_lg2 ...
 [6] <shreddit-comment author="Dear-Bet7063" entity-filter-id="" thingid="t1_ ...
 [7] <shreddit-comment author="etka64" entity-filter-id="" thingid="t1_lg1zei ...
 [8] <shreddit-comment author="DoctorMario1000" entity-filter-id="" thingid=" ...
 [9] <shreddit-comment author="[deleted]" entity-filter-id="" thingid="t1_lg4 ...
[10] <shreddit-comment author="Deadecigs" entity-filter-id="" thingid="t1_lg2 ...
[11] <shreddit-comment author="[deleted]" entity-filter-id="" thingid="t1_lg1 ...
[12] <shreddit-comment author="VisualFlop" entity-filter-id="" thingid="t1_lg ...
[13] <shreddit-comment author="LostandConfused2024" entity-filter-id="" thing ...
[14] <shreddit-comment author="Gristle__McThornbody" entity-filter-id="" thin ...
[15] <shreddit-comment author="mercibien1" entity-filter-id="" thingid="t1_lg ...
[16] <shreddit-comment author="Tay_Tay86" entity-filter-id="" thingid="t1_lg1 ...
[17] <shreddit-comment author="etka64" entity-filter-id="" thingid="t1_lg1zll ...
[18] <shreddit-comment author="SquirtDoctor23" entity-filter-id="" thingid="t ...
[19] <shreddit-comment author="Creeper15877" entity-filter-id="" thingid="t1_ ...
[20] <shreddit-comment author="DemandOk5785" entity-filter-id="" thingid="t1_ ...
...

😁

But again, something is missing 🤔

Interacting with the session

html_post_live$view()

Scrolling down as far as possible:

html_post_live$scroll_to(top = 10 ^ 5)

Then clicking the “load_more_comments”-button

html_post_live$click("[noun=\"load_more_comments\"]")

This triggers new content to be loaded:

comments <- html_post_live |> 
  html_elements("shreddit-comment")
comments
{xml_nodeset (180)}
 [1] <shreddit-comment author="YeezyThoughtMe" entity-filter-id="" thingid="t ...
 [2] <shreddit-comment author="SighRamp" entity-filter-id="" thingid="t1_lg2u ...
 [3] <shreddit-comment author="Taco_01" entity-filter-id="" thingid="t1_lg2x5 ...
 [4] <shreddit-comment author="Taco_01" entity-filter-id="" thingid="t1_lg2wx ...
 [5] <shreddit-comment author="Deadecigs" entity-filter-id="" thingid="t1_lg2 ...
 [6] <shreddit-comment author="Dear-Bet7063" entity-filter-id="" thingid="t1_ ...
 [7] <shreddit-comment author="etka64" entity-filter-id="" thingid="t1_lg1zei ...
 [8] <shreddit-comment author="DoctorMario1000" entity-filter-id="" thingid=" ...
 [9] <shreddit-comment author="[deleted]" entity-filter-id="" thingid="t1_lg4 ...
[10] <shreddit-comment author="Deadecigs" entity-filter-id="" thingid="t1_lg2 ...
[11] <shreddit-comment author="[deleted]" entity-filter-id="" thingid="t1_lg1 ...
[12] <shreddit-comment author="VisualFlop" entity-filter-id="" thingid="t1_lg ...
[13] <shreddit-comment author="LostandConfused2024" entity-filter-id="" thing ...
[14] <shreddit-comment author="Gristle__McThornbody" entity-filter-id="" thin ...
[15] <shreddit-comment author="mercibien1" entity-filter-id="" thingid="t1_lg ...
[16] <shreddit-comment author="Tay_Tay86" entity-filter-id="" thingid="t1_lg1 ...
[17] <shreddit-comment author="etka64" entity-filter-id="" thingid="t1_lg1zll ...
[18] <shreddit-comment author="SquirtDoctor23" entity-filter-id="" thingid="t ...
[19] <shreddit-comment author="Creeper15877" entity-filter-id="" thingid="t1_ ...
[20] <shreddit-comment author="DemandOk5785" entity-filter-id="" thingid="t1_ ...
...

Automate the scrolling

last_y <- -1
#scroll as far as possible
while (html_post_live$get_scroll_position()$y > last_y) {
  last_y <- html_post_live$get_scroll_position()$y
  html_post_live$scroll_to(top = 10 ^ 5)
  load_more <- html_post_live |> 
    html_element("[noun=\"load_more_comments\"]") |> 
    length()
  if (load_more) {
    html_post_live$click("[noun=\"load_more_comments\"]")
    html_post_live$scroll_to(top = 10 ^ 5)
  }
  # wati random number of seconds
  sec_random <- runif(1, 1, 3) * 5
  message("Scrolled down, waiting: ", round(sec_random, 1), " s")
  Sys.sleep(sec_random)
}

Alternative: Playwright

Introducing Playwright

  • Tool for web testing
  • Testing a website and scraping it is actually quite similar
  • It essentially uses a special version of a web browser that can be controlled through code from different languages
  • Unfortunately no R package that wraps the API yet (but and R package that wraps the Python package)
  • Alternatives you might have heard of: Selenium and Puppeteer

First, install it

We want to use playwrightr, which is an R package to control the Python package for Playwright. So we need 3 pieces for this:

  1. The R package: install it with remotes::install_github("JBGruber/playwrightr")
  2. The Python package: we install this into a virtual environment using reticulate
  3. The Playwright executable, which consists of a modified version of Chrome that can be remote controlled

All three steps are done when you run the code below:

if (!rlang::is_installed("playwrightr")) remotes::install_github("JBGruber/playwrightr")
if (!reticulate::virtualenv_exists("r-playwright")) {
  reticulate::virtualenv_install("r-playwright", packages = "playwright")
  playwright_bin <- reticulate::virtualenv_python("r-playwright") |> 
    stringr::str_replace("python", "playwright")
  system(paste(playwright_bin, "install chromium"))
}
reticulate::use_virtualenv("r-playwright")

Control Playwright from r with an experimental package

I did not write the package, but made some changes to make it easier to use.

To get started, we first initialize the underlying Python package and then launch Chromium:

library(reticulate)
library(playwrightr)
pw_init()
chrome <- browser_launch(
  browser = "chromium", 
  headless = !interactive(), 
  # make sure data like cookies are stored between sessions
  user_data_dir = "user_data_dir/"
)

Now we can navigate to a page:

page <- new_page(chrome)
goto(page, "https://www.facebook.com/groups/911542605899621")

When you are in Europe, the page asks for consent to save cookies in your browser:

Getting the page content

Okay, we now see the content. But what about collecting it? We can use several different get_* functions to identify specfic elements. But wen can also simply get the entire HTML content:

html <- get_content(page)
html

Conveniently, this is already an rvest object. So we can use our familiar tools to get to the links of the visible posts. The page uses a role attribute which Iemploy here and I know that links to posts contain posts:

post_links <- html |> 
  html_elements("[role=\"link\"]") |> 
  html_attr("href") |> 
  str_subset("posts")
head(post_links)

Collecting Post content

Now we can visit the page of one of these posts and collect the content from it:

post1 <- new_page(chrome)
# go to the page
goto(post1, post_links[1])
post1_html <- get_content(post1)

We can check the content we collected locally:

check_in_browser <- function(html) {
  tmp <- tempfile(fileext = ".html")
  writeLines(as.character(html), tmp)
  browseURL(tmp)
}
check_in_browser(post1_html)

Scraping the content: failure

The site uses a lot of weird classes, Making it almost impossible to get content:

author <- post1_html |> 
  html_elements("#_R_laimqaqdfiqapapd5aqH1_ .x193iq5w") |> 
  html_text2() |> 
  head(1)

text <- post1_html |> 
  html_elements("[data-ad-rendering-role=\"story_message\"] #_R_laimqaqdfiqapapd5aqH2_") |> 
  html_text2()

tibble(author, text)

Nevertheless, some success…

Scraping the content: Another attempt

Getting json data

json_data <- post1_html |> 
  html_elements("[type=\"application/json\"][data-processed]") |> 
  map(function(x) x |> 
        as.character() |> 
        str_extract("\\{.*\\}") |> 
        jsonlite::fromJSON())

which.max(map_int(json_data, object.size))
parse_path <- function(ix) {
  out <- as.list(ix$p)
  out[which(ix$p == as.character(ix$pos))] <- ix$pos[ix$p == as.character(ix$pos)]
  gsub("list(", "purrr::pluck(DATA, ", deparse1(out), fixed = TRUE)
}

#' Search a list
#'
#' @param l a list
#' @param f a function to identify the element you are searching
#'
#' @return an object containing the searched element with the function to extract it as a name
#' @export
list_search <- function(l, f) {
  
  paths <- rrapply::rrapply(
    object = l,
    condition = f,
    f = function(x, .xparents, .xname, .xpos) list(p = .xparents, n = .xname, pos = .xpos),
    how = "flatten"
  )
  
  out <- purrr::map(paths, function(p) purrr::pluck(l, !!!p$pos))
  names(out) <- purrr::map_chr(paths, parse_path)
  return(out)
}
list_search(json_data, function(x) str_detect(x, "Nabídka spolehlivé flotily pro řidiče s vlastním"))
purrr::pluck(json_data, 71L, "require", 1L, 4L, 1L, "__bbox", "require", 111L, 4L, 2L, "__bbox", "result", "data", "node", "comet_sections", "content") |> 
  View()

Mission: success (?)

Not sure how scalable this is or how stable. But it seems like we got the data for one post, at least. After getting the content you want (or not), we can close the page:

close_page(post1)

What is cool about Playwright

playwright_bin <- reticulate::virtualenv_python("r-playwright") |> 
  stringr::str_replace("python", "playwright")
system(paste(playwright_bin, "codegen"))

This can produce Python scripts

import playwright
def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://www.google.com/")
    page.get_by_role("button", name="Accept all").click()
    page.get_by_label("Search", exact=True).click()
    page.get_by_label("Search", exact=True).fill("amsterdam bijstand")
    page.get_by_role("link", name="Bijstandsuitkering Gemeente").click()

    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Summary: Browser Automation

What is it

  • remote control a browser to perform pre-defined steps
  • several available tools:
    • native R: chromote, used in read_html_live
    • native Python: Playwright, can be used from R with playwrightr (experimental)
    • native Java: Selenium, can be used from R with RSelenium (buggy and outdated)
    • native JavaScript: Puppeteer not R bindings

What are they good for?

  • Get content from pages which you can’t otherwise access
  • Load more content through automated scrolling on dynamic pages
  • Automate tasks like downloading files

Issues

  • Companies have mechanisms to counter scraping:
    • rate limiting requests per second/minute/day and user/IP(Twitter)
    • captchas (can be solved but quite complex)
  • Won’t get you around very obscure HTML code (Facebook)
  • Quite heavy and very slow compared to requests

Wrap Up

Save some information about the session for reproducibility.

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rvest_1.0.4     httr2_1.2.0     lubridate_1.9.4 forcats_1.0.0  
 [5] stringr_1.5.1   dplyr_1.1.4     purrr_1.1.0     readr_2.1.5    
 [9] tidyr_1.3.1     tibble_3.3.0    ggplot2_3.5.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3    generics_0.1.4    xml2_1.3.8        lattice_0.22-7   
 [5] stringi_1.8.7     hms_1.1.3         digest_0.6.37     magrittr_2.0.3   
 [9] evaluate_1.0.4    grid_4.5.1        timechange_0.3.0  fastmap_1.2.0    
[13] Matrix_1.7-3      jsonlite_2.0.0    processx_3.8.6    chromote_0.5.1   
[17] ps_1.9.1          promises_1.3.2    httr_1.4.7        selectr_0.4-2    
[21] scales_1.3.0      cli_3.6.5         rlang_1.1.6       munsell_0.5.1    
[25] withr_3.0.2       yaml_2.3.10       tools_4.5.1       tzdb_0.5.0       
[29] colorspace_2.1-1  reticulate_1.42.0 curl_6.4.0        png_0.1-8        
[33] vctrs_0.6.5       R6_2.6.1          lifecycle_1.0.4   pkgconfig_2.0.3  
[37] pillar_1.11.0     later_1.4.2       gtable_0.3.6      glue_1.8.0       
[41] Rcpp_1.1.0        xfun_0.52         tidyselect_1.2.1  rstudioapi_0.17.1
[45] knitr_1.50        websocket_1.4.4   htmltools_0.5.8.1 rmarkdown_2.29   
[49] compiler_4.5.1